Part of the DGfS PhD students’ forum, 23 February 2021
Instructors: Kyla McConnell and Julia Müller
Contact us on Twitter (@McconnellKyla, @JuliaMuellerFr)
Unless indicated, artwork is by the wonderful @allison_horst - find her on github.
Let’s jump right in and load the package:
library(tidyverse)
The tidyverse is an extremely useful collection of R packages (i.e. add-ons to base-R) that help you get your data into a useful format (among other things).
The tidyverse
The following packages are included in the tidyverse:
- ggplot2: for data visualisation
- tibble: for tweaked dataframes
- tidyr: for data wrangling
- readr: for reading in files
- purrr: for functional progamming
- dplyr: for data manipulation
- stringr: for string manipulation
- forcats: for working with categorical variables (factors)
Characteristics of tidy data:
What does tidy data look like?
Why this format?
- a lot of wrangling commands are based on the assumption that your data is tidy - is expected for many statistical models - works best for plotting - “Tidy datasets are all alike, but every messy dataset is messy in its own way” Hadley Wickham
Why use tidy data?
R will often “talk” to you when you’re running code. For example, when you install a package, it’ll tell you e.g. where it is downloading a package from, and when it’s done. Similarly, when you loaded the tidyverse collection of packages, R listed them all. That’s nothing to worry about!
When there’s a mistake in the code (e.g. misspelling a variable name, forgetting to close quotation marks or brackets), R will give an error and be unable to run the line. The error message will give you important information about what went wrong.
hello <- "hi"
#Hello
In contrast, warnings are shown when R thinks there could be an issue with your code or input, but it still runs the line. R is generally just telling you that you MIGHT be making a logical error, not that the code is impossible.
c(1, 2, 3) + c(1, 2)
We’ll get to know a number of R functions today. These functions can take one or more arguments. As an example, let’s try out the say() function from the cowsay package.
First, (install and) load the cowsay package:
# install.packages("cowsay")
library(cowsay)
Try the following code:
say(
what = "Good luck learning R!",
by = "rabbit")
We can see that this function has the what argument (what should be said?) and the by argument (which animal should say it?). But what other options are there for this command - which other animals, for example, or can you change the colour? To see the documentation for the say command, you can either run this line of code:
?say
…or type in say in the Help tab on the bottom right.
This will show you the documentation for the command.
say() without any arguments in the brackets, you’d get the defaults, i.e. a cat saying “Hello world!”)say()
Arguments provides more information on each argument. Arguments are the options you can use within a function.
what
by
type
what_color etc. Each of these can be fed the say() function to slightly alter what it does.
Examples at the bottom of the help page lists a few examples you can copy-paste into your code to better understand how a function works.
Don’t worry if you don’t understand everything in the documentation when you’re first starting out. Just try to get an idea for which arguments there are and which options for those arguments. It’s good practice to look at help documents often – this will also help you get more efficient at extracting the info you need from them.
What type of file are you working with? Specifically, what’s used to separate the values in the different columns? To find out, open the file in a text editor. Common options are either commas or semicolons (in which cases, the file often has the ending .csv) or tabs (often .txt files).
Where is the file saved? If you are working in a Markdown and the data file is saved in the same location as the script file, you can use the name of the file with its ending, e.g. “new_data.csv” or “final_data.txt”. This is possible because RMarkdown automatically sets the working directory, i.e. the location R tries to find the file, to where it is saved. This is very convenient especially when you share your script and data. Your contributor doesn’t need to type in any long paths but can directly start working. If the file is saved in a subfolder called “data”, use “data/example_file.csv”). ../ allows you to go backwards one folder.
If you’re working in a script, use setwd() (e.g.: setwd(~Documents/PhD/Rscripts/...) or setwd(C:/Documents...)), or Session -> Set Working Directory -> Choose Directory…)
The commands to read in a file are: - base-R: read.csv() and read.tsv() - tidyverse (improvements, reads as tibble instead of dataframe): read_csv() and read_tsv() - must have tidyverse installed and loaded with a library() call - read_csv2() for semicolon-separated csv - read_delim(file, delim) for any other delimiters
Save the output to a variable using <-
So reading in data tends to follow this pattern:
name_of_data_in_R <- read_csv("data_file.csv") # equivalent to
name_of_data_in_R <- read_delim("data_file.csv", delim = ",")
name_of_data_in_R <- read_csv2("data_file.csv") # equivalent to
name_of_data_in_R <- read_delim("data_file.csv", delim = ";")
name_of_data_in_R <- read_tsv("data_file.txt") # tab-separated file
We’ve tried to tailor the workshop to be relevant to the kinds of data many of you said you used in the pre-workshop survey, so we’ll start with self-paced reading data (which looks fairly similar to eye-tracking data, another common response), and later use examples of corpus data and a questionnaire output. However, everything we discuss is useful for any kind of data! We’re also providing several files for you to practise on later, so you can pick data that looks closest to what you’re actually working with.
Let’s read in a small self-paced reading dataset (saved in the data folder, so we need to add data/ to tell R that):
spr <- read_csv("data/dgfs_spr.csv")
Parsed with column specification:
cols(
X1 = col_double(),
participant = col_character(),
item_type = col_character(),
sentence_num = col_double(),
cond = col_character(),
word = col_character(),
RT = col_double(),
full_sentence = col_character(),
word_num = col_double()
)
The current dataframe is a self-paced reading experiment where 12 participants read 20 sentences each, plus 3 practice sentences to get them warmed up. Half the sentences were about dogs and half the sentences were about cats. In one condition (A), all sentences were paired with appropriate adjectival collocates according to the BNC (lap dog vs. tortoiseshell cat), in the other, these were reverse (lap cat vs. tortoiseshell dog). All sentences were presented in otherwise natural-sounding sentences.
Now you have a data file read in, but how do you see what’s in it?
head(spr)
You can change the amount of rows with the n argument:
head(spr, n=3)
Or: click name of dataframe in Environment tab - can also sort columns and Filter rows – just for viewing purposes - bit slow if you start having huge dataframes but often a good first look
There’s also an easy way to see what the columns are:
colnames(spr)
summary(): call it on a dataframe to get each column and useful info based on the data type. For example, numeric columns will show the min, median, max and the quartiles (25% increments).
summary(spr)
summary(spr$RT)
Read in and explore the example data How many rows and columns does it have?
In the environment panel (or using str()), you can see that all the variables in this data are read in as either numeric or character data. However, some variables should be treated as factors because they represent categories, not text data. Let’s convert them.
This way, the summary output also works as expected. For example, we can have a look at how many different participants and conditions:
summary(spr$participant)
summary(spr$cond)
If any other columns in your dataframe are read in wrong (for example, if you have a numeric column that looks like: “43”, “18” and is being read as a character column) you can convert them with similar syntax: as.numeric(), as.character() etc.
One of the most noticeable features of the tidyverse is the pipe %>% (keyboard shortcut: Ctr/Cmd + Shift + M) .
The pipe takes the item before it and feeds it to the following command as the first argument. Since all tidyverse (and some non-tidyverse) functions take the dataframe as the first argument, this can be used to string together multiple functions.
So to re-write the head() function with the pipe:
spr %>%
head()
This produces the exact same output as head(spr). Why would this be useful?
You can see that the version with the pipe is easier to read when more than one function is called on the same dataframe!
Here are some more examples:
# Equivalent to summary(spr)
spr %>%
summary()
# Equivalent to colnames(spr), which returns all column names
spr %>%
colnames()
# Equivalent to nrow(spr), which returns the number of rows in the df
spr %>%
nrow()
And you can also stack commands by ending each row (except the last one) with a pipe:
spr %>%
colnames() %>% #extracts column names
nchar() #counts the number of letters
You can rename columns with the rename() function. The syntax is new_name = old_name. Let’s rename the cond variable:
spr %>%
rename(condition = cond)
This is just a preview because we didn’t assign the changed dataframe to any name. This is useful for testing code and making sure it does what you expect and want it to do.
if you look at the spr dataframe, for example in the Environment panel on the upper-right, the dataframe hasn’t changed
To save your changes: assigning your call back to the variable name
Good work flow: Preview then save when you’re sure you’re happy with the output
You can also rename multiple columns at once (no need for an array here):
Notice above that I’ve saved output over the spr dataframe to make the changes ‘permanent’.
There is no output when you simply save the response, but the spr dataframe has been permanently updated (within the R session, not in your file system)
If you make a mistake: arrow with a line under it in the code block of R-Markdown, runs all blocks above (but not the current one)
There’s also a corresponding command that lets you sort by row values: arrange(). By default, this sorts by lowest to highest value, but you can add desc() to reverse that.
spr %>%
arrange(RT)
spr %>%
arrange(desc(RT))
The traditional syntax for dealing with columns is dataframe\(column. A useful step in using pipes and tidyverse calls is the ability to *select* specific columns. That is, instead of writing `spr\)RT` we can write:
spr %>%
select(RT)
You can also use select() to take multiple columns.
spr %>%
select(participant, word, RT)
You can see that these columns are presented in the order you gave them to the select call, too:
spr %>%
select(RT, word, participant)
You can also use this to reorder columns in that you give the name of the column(s) you want first, then finish with everything() to have all other columns follow:
spr %>%
select(RT, everything())
Above, we’re mostly previewing using select() for first insights. If you look at the spr dataframe, for example in the Environment panel on the upper-right, the dataframe hasn’t changed. To save your changes, assign your call back to the variable name, i.e. df <- df %>% some operations here
You can also remove columns using select if you use the minus sign. For example, the item_type column is a factor with only one level - it always says “DashedSentence”. So let’s get rid of it:
spr %>%
select(-item_type)
You can also remove multiple columns at once by writing them in an array c(). We’d like to remove the item type column and also the first column (X1) which seems to be just a counter.
This overwrites the data as it is saved in R. It does not overwrite the file that is saved on your computer.
Until now, we’ve used select() in combination with the full column name, but there are helper functions that let you select columns based on other criteria.
For example, here’s how we can select both the sentence_num and the word_num column - by specifying ends_with("_num") in the select() call:
spr %>%
select(ends_with("_num"))
The opposite is also possible using starts_with()
contains is another helper function. Here, we’re using it to show all columns that contain an underscore:
spr %>%
select(contains("_"))
We can also select a range of variables using a colon. This works both with variables and (a range of) numbers:
spr %>%
select(condition:word) # shows condition, word_num, word
spr %>%
select(1:3) # first three columns
Here, the order of the columns matters!
Other helper functions are: - matches: similar to contains, but can use regular expressions - num_range: in a dataset with the variables X1, X2, X3, X4, Y1, and Y2, select(num_range(“X”, 1:3)) returns X1, X2, and X3
While with select(), you can pick columns by name or if they fulfill conditions, filter() lets you look for rows that fulfill certain conditions.
Filter
Use filter() to return all items that fit a certain condition. For example, you can use: - equals to: == - not equal to: != - greater than: > - greater than or equal to: >= - less than: < - less than or equal to: <= - in (i.e. in a vector): %in%
Syntax: filter(data, columnname logical-operator condition) or, using the pipe: data %>% filter(columnname logical-operator condition)
Let’s look at reaction times that are shorter than 200 ms:
spr %>%
filter(RT < 200)
…reaction times longer than or equal to 250 ms:
spr %>%
filter(RT >= 250)
Or you can use it to select all items in a given category. Notice here that you have to use quotation marks to show you’re matching a character string. Look at the error below:
spr %>%
filter(word == relative)
The correct syntax is: (because you’re matching to a string)
spr %>%
filter(word == "relative")
You can also use filter to easily drop rows. Let’s drop all practice rows and save the output.
To use %in%, give an array of options (formatted in the correct way based on whether the column is a character or numeric):
spr %>%
filter(word %in% c("cat", "dog"))
spr %>%
filter(sentence_num %in% c(2, 4))
Note that filter is case-sensitive, so capitalization matters.
We can also specify several conditions in one filter() call, e.g.
spr %>%
filter(word == "relative" & RT > 200)
spr %>%
filter(RT > 300 | RT < 150)
We can also “chain” different functions, which is one of the things that makes the pipe so useful. For example, we could filter for data in condition B, which is an incomplete sentence, and only look at the words and their response times:
spr %>%
filter(condition == "cond_B_dog") %>%
select(word, RT)
One useful function that can be chained to filter() is distinct(), which will return only the unique rows. Without an argument, it returns all rows that are unique in all columns.
You can also add a column name as an argument to return only the unique values in a certain column (useful with factors)
spr %>%
filter(condition == "cond_B_dog") %>%
distinct(sentence)
You can also use it on its own to return unique values or combinations of values.
spr %>%
distinct(sentence, condition)
For the next two examples, we’ll use a different dataset called animal_corpus. Read it in and familiarise yourself with its contents.
animal_corpus <- read_csv2("data/cat_dog_corpus_data.csv")
This contains corpus data of “cat” and “dog” together with the words that precede “cat” and “dog”. These are called “collocates” and, in our example, also include part of speech tags (the format is word_tag).
This data is not tidy. Why?
The collocates column contains two variables (word and tag) although according to tidy data principles, each variable should be saved in its own column.
Luckily, the tidyverse has a command for that: separate(). It takes the following arguments: - data: our dataframe, we’ll pipe it - col: which column needs to be separated - into: a vector that contains the names of the new columns - sep: which symbol separates the values - remove (optional): by default, the original column will be deleted. Set remove to FALSE to keep it.
Another example: In the SPR data, the condition column contains two pieces of information: which condition the item was in (condA or condB) and which animal was being read about (cat or dog)
The opposite of separate(). This lets you glue columns together. - col is the name of the new column - the next argument, a vector, lists the columns that should be united - sep, as above, lets you specify how the values should be separated
Let’s say we’d like our data to be in the format “collocate cat/dog”, so without the tag, but in one column, separated by a space.
This leads to a lot of repetition - some bigrams appear several times in just our preview. To remedy this, we can use distinct(), which only keeps unique rows. To make clear that we want unique collocate cat/dog combinations, we can put coll_word into distinct() to make clear that this is the relevant column and tags should be ignored.
With the mutate() function, you can change existing columns and add new ones. The syntax is: mutate(data, col_name = some_operation) or, with the pipe: data %>% mutate(col_name = some_operation).
Mutate
The response times are measured in ms. Let’s convert them to seconds by dividing by 1000:
(spr <- spr %>%
mutate(RT_s = RT / 1000))
Now, there’s a new column called RT_s (it’s at the very end by default).
You can also save the new column with the same name, and this will update all the items in that column (see below, where I divide response times by 1000, but note that I don’t save the output):
spr %>%
mutate(RT = RT / 1000)
You can also do operations to character columns - for example:
We can also change data types using mutate(). Instead of the code we used earlier to convert participant and condition to factors, we could write:
spr <- spr %>%
mutate(participant = as_factor(participant),
item_type = as_factor(item_type),
condition = as_factor(condition))
Fehler: Problem with `mutate()` input `item_type`.
x Objekt 'item_type' nicht gefunden
i Input `item_type` is `as_factor(item_type)`.
As you can see, we can change several variables within one mutate() call. In the same way, we could create several new columns at the same time. Here, dropping each new column onto a new line is considered good style and makes the code more readable (but it’s not necessary).
In our SPR experiment, condition A represents a match (i.e. cat/dog is presented with a matching collocate: purring cat, guide dog) and condition B is a mismatch (e.g. guide cat, purring dog). To make this clear in the data, we should label this explicitly. Within a mutate() command, we can use recode() to change the factor labels. The format for this is old label = “new label”.
Let’s look at our third dataset, animal_survey. It contains data from the same participants who answered a few sociodemographic questions and also indicated how cute they think animals are, how much they(’d) like look at, pet, and own an animal (scale: 1-5). Besides that, they were also asked to rate cats and dogs (scale: 1-7).
animal_survey <- read_csv("animal_survey.csv")
In our animal_survey data, education is represented by numeric codes, 1-4. We could turn these into labels like so:
Because these numbers should be treated as characters, we need to put them in quotation marks!
Now for something fancy. You can also make new columns based on “if” conditions using the call ifelse(). The syntax of ifelse is: ifelse(this_is_true, this_happens, else_this_happens). For example, we could create a column called “RT_short” that contains “short” if the response time is faster than 100 ms and “long” if it isn’t:
spr %>%
mutate(RT_short = ifelse(RT < 100, "short", "long")) %>%
select(RT, RT_short)
You can also use ifelse on categorical / character columns:
What if you have several conditions? For example, if RTs are shorter than 100 ms, they should be labelled “short”, if they’re longer than 500 ms, “long”, and “normal” for all other RTs. While it’s possible to chain several if_else() statements, it gets confusing and hard to read. Instead, we should use case_when().